18 research outputs found

    Iterative parameter mixing for distributed large-margin training of structured predictors for natural language processing

    Get PDF
    The development of distributed training strategies for statistical prediction functions is important for applications of machine learning, generally, and the development of distributed structured prediction training strategies is important for natural language processing (NLP), in particular. With ever-growing data sets this is, first, because, it is easier to increase computational capacity by adding more processor nodes than it is to increase the power of individual processor nodes, and, second, because data sets are often collected and stored in different locations. Iterative parameter mixing (IPM) is a distributed training strategy in which each node in a network of processors optimizes a regularized average loss objective on its own subset of the total available training data, making stochastic (per-example) updates to its own estimate of the optimal weight vector, and communicating with the other nodes by periodically averaging estimates of the optimal vector across the network. This algorithm has been contrasted with a close relative, called here the single-mixture optimization algorithm, in which each node stochastically optimizes an average loss objective on its own subset of the training data, operating in isolation until convergence, at which point the average of the independently created estimates is returned. Recent empirical results have suggested that this IPM strategy produces better models than the single-mixture algorithm, and the results of this thesis add to this picture. The contributions of this thesis are as follows. The first contribution is to produce and analyze an algorithm for decentralized stochastic optimization of regularized average loss objective functions. This algorithm, which we call the distributed regularized dual averaging algorithm, improves over prior work on distributed dual averaging by providing a simpler algorithm (used in the rest of the thesis), better convergence bounds for the case of regularized average loss functions, and certain technical results that are used in the sequel. The central contribution of this thesis is to give an optimization-theoretic justification for the IPM algorithm. While past work has focused primarily on its empirical test-time performance, we give a novel perspective on this algorithm by showing that, in the context of the distributed dual averaging algorithm, IPM constitutes a convergent optimization algorithm for arbitrary convex functions, while the single-mixture distribution algorithm is not. Experiments indeed confirm that the superior test-time performance of models trained using IPM, compared to single-mixture, correlates with better optimization of the objective value on the training set, a fact not previously reported. Furthermore, our analysis of general non-smooth functions justifies the use of distributed large-margin (support vector machine [SVM]) training of structured predictors, which we show yields better test performance than the IPM perceptron algorithm, the only version of the IPM to have previously been given a theoretical justification. Our results confirm that IPM training can reach the same level of test performance as a sequentially trained model and can reach better accuracies when one has a fixed budget of training time. Finally, we use the reduction in training time that distributed training allows to experiment with adding higher-order dependency features to a state-of-the-art phrase-structure parsing model. We demonstrate that adding these features improves out-of-domain parsing results of even the strongest phrase-structure parsing models, yielding a new state-of-the-art for the popular train-test pairs considered. In addition, we show that a feature-bagging strategy, in which component models are trained separately and later combined, is sometimes necessary to avoid feature under-training and get the best performance out of large feature sets

    Analysis of shared common genetic risk between amyotrophic lateral sclerosis and epilepsy

    Get PDF
    Because hyper-excitability has been shown to be a shared pathophysiological mechanism, we used the latest and largest genome-wide studies in amyotrophic lateral sclerosis (n = 36,052) and epilepsy (n = 38,349) to determine genetic overlap between these conditions. First, we showed no significant genetic correlation, also when binned on minor allele frequency. Second, we confirmed the absence of polygenic overlap using genomic risk score analysis. Finally, we did not identify pleiotropic variants in meta-analyses of the 2 diseases. Our findings indicate that amyotrophic lateral sclerosis and epilepsy do not share common genetic risk, showing that hyper-excitability in both disorders has distinct origins

    Genomic Relationships, Novel Loci, and Pleiotropic Mechanisms across Eight Psychiatric Disorders

    Get PDF
    Genetic influences on psychiatric disorders transcend diagnostic boundaries, suggesting substantial pleiotropy of contributing loci. However, the nature and mechanisms of these pleiotropic effects remain unclear. We performed analyses of 232,964 cases and 494,162 controls from genome-wide studies of anorexia nervosa, attention-deficit/hyper-activity disorder, autism spectrum disorder, bipolar disorder, major depression, obsessive-compulsive disorder, schizophrenia, and Tourette syndrome. Genetic correlation analyses revealed a meaningful structure within the eight disorders, identifying three groups of inter-related disorders. Meta-analysis across these eight disorders detected 109 loci associated with at least two psychiatric disorders, including 23 loci with pleiotropic effects on four or more disorders and 11 loci with antagonistic effects on multiple disorders. The pleiotropic loci are located within genes that show heightened expression in the brain throughout the lifespan, beginning prenatally in the second trimester, and play prominent roles in neurodevelopmental processes. These findings have important implications for psychiatric nosology, drug development, and risk prediction.Peer reviewe

    Finishing the euchromatic sequence of the human genome

    Get PDF
    The sequence of the human genome encodes the genetic instructions for human physiology, as well as rich information about human evolution. In 2001, the International Human Genome Sequencing Consortium reported a draft sequence of the euchromatic portion of the human genome. Since then, the international collaboration has worked to convert this draft into a genome sequence with high accuracy and nearly complete coverage. Here, we report the result of this finishing process. The current genome sequence (Build 35) contains 2.85 billion nucleotides interrupted by only 341 gaps. It covers ∌99% of the euchromatic genome and is accurate to an error rate of ∌1 event per 100,000 bases. Many of the remaining euchromatic gaps are associated with segmental duplications and will require focused work with new methods. The near-complete sequence, the first for a vertebrate, greatly improves the precision of biological analyses of the human genome including studies of gene number, birth and death. Notably, the human enome seems to encode only 20,000-25,000 protein-coding genes. The genome sequence reported here should serve as a firm foundation for biomedical research in the decades ahead

    Genomic reconstruction of the SARS-CoV-2 epidemic in England.

    Get PDF
    The evolution of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus leads to new variants that warrant timely epidemiological characterization. Here we use the dense genomic surveillance data generated by the COVID-19 Genomics UK Consortium to reconstruct the dynamics of 71 different lineages in each of 315 English local authorities between September 2020 and June 2021. This analysis reveals a series of subepidemics that peaked in early autumn 2020, followed by a jump in transmissibility of the B.1.1.7/Alpha lineage. The Alpha variant grew when other lineages declined during the second national lockdown and regionally tiered restrictions between November and December 2020. A third more stringent national lockdown suppressed the Alpha variant and eliminated nearly all other lineages in early 2021. Yet a series of variants (most of which contained the spike E484K mutation) defied these trends and persisted at moderately increasing proportions. However, by accounting for sustained introductions, we found that the transmissibility of these variants is unlikely to have exceeded the transmissibility of the Alpha variant. Finally, B.1.617.2/Delta was repeatedly introduced in England and grew rapidly in early summer 2021, constituting approximately 98% of sampled SARS-CoV-2 genomes on 26 June 2021

    Logic and the comprehension of language

    Get PDF
    This thesis examines what is necessary to formally model a hearer\u27s comprehension of a natural language sentence. Our theory of comprehension should at least explain how different words within the same grammatical class make different contributions to the meaning of a sentence. And, our theory should explain how the ``full propositional form\u27\u27 that a speaker communicates is recovered from the relatively semantically underspecified acoustic signal. A model is provided which achieves this. A speaker is said to understand an utterance by, first, choosing the maximally ``relevant\u27\u27 full propositional semantic enrichment of the underspecified acoustic signal, measured according to a formally defined comparison operator, and, then, computing the inferences that follow from that chosen propositional form in conjunction with their individual word-/world-knowledge. This model of comprehension apparently makes comprehension relative to an individual\u27s idiosyncratic knowledge. So, I also discuss how conventionalized word-meanings co-ordinate individuals\u27 knowledges to allow successful interpersonal communication

    Analysis of shared common genetic risk between amyotrophic lateral sclerosis and epilepsy

    Get PDF
    Because hyper-excitability has been shown to be a shared pathophysiological mechanism, we used the latest and largest genome-wide studies in amyotrophic lateral sclerosis (n = 36,052) and epilepsy (n = 38,349) to determine genetic overlap between these conditions. First, we showed no significant genetic correlation, also when binned on minor allele frequency. Second, we confirmed the absence of polygenic overlap using genomic risk score analysis. Finally, we did not identify pleiotropic variants in meta-analyses of the 2 diseases. Our findings indicate that amyotrophic lateral sclerosis and epilepsy do not share common genetic risk, showing that hyper-excitability in both disorders has distinct origins
    corecore